Zipf's law against the text size: a half-rational model
نویسنده
چکیده
In this article, we consider Zipf-Mandelbrot law as applied to texts in natural languages. We present a simple model of dependence of the law on the text size, which is featured by variable power-law tail and constant ratio of the most frequent words. As a result we derive several closed formulas, which accord with empirical data qualitatively and partially quantitatively. For example, there appears to be a minimal length of literary texts equal to ≈ 159 word tokens for English.
منابع مشابه
A Stochastic Process for Word Frequency Distributions
A stochastic model based on insights of Mandelbrot (1953) and Simon (1955) is discussed against the background of new criteria of adequacy that have become available recently as a result of studies of the similarity relations between words as found in large computerized text corpora. FREQUENCY DISTRIBUTIONS Various models for word frequency distributions have been developed since Zipf (1935) ap...
متن کاملOn the Applicability of Zipf's Law in Chinese Word Frequency Distribution
Zipf's Law uncovers the relationship between word frequency and its rank. This paper addresses applicability of Zipf's Law in Chinese word frequency distribution. The previous studies on Zipf’s law in Chinese were primarily based on raw corpus, without word segmentation, hence there are obvious limitations. This study investigates the topic in several large-scale POS-tagged Chinese corpora. The...
متن کاملComments on "linguistic features in eukaryotic genomes"
Tsonis and Tsonis [1] study rank-ordered distributions of the number of occurrences of protein domains in four different organisms, and they argue that the power-law decay, f ϰ 1/r, of the number f of occurrences of a protein domain with its rank r suggests the presence of linguistic features in eukaryotic genomes, and that this finding " may lead to important clues about the evolution of langu...
متن کاملMultilingual Statistical Text Analysis, Zipf's Law and Hungarian Speech Generation
The practical challenge of creating a Hungarian e-mail reader has initiated our work on statistical text analysis. The starting point was statistical analysis for automatic discrimination of the language of texts. Later it was extended to automatic re-generation of diacritic signs and more detailed language structure analysis. Parallel study of three different languages Hungarian. German and En...
متن کاملZipf's Law Leads to Heaps' Law: Analyzing Their Relation in Finite-Size Systems
BACKGROUND Zipf's law and Heaps' law are observed in disparate complex systems. Of particular interests, these two laws often appear together. Many theoretical models and analyses are performed to understand their co-occurrence in real systems, but it still lacks a clear picture about their relation. METHODOLOGY/PRINCIPAL FINDINGS We show that the Heaps' law can be considered as a derivative ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Glottometrics
دوره 4 شماره
صفحات -
تاریخ انتشار 2002